Data Science Interview Problems 


Questions 
A.5.1. Estimation 


What’s an estimate for how many mini shampoo bottles are used by 
all the hotels in the United States in a year? 





A.5.2. Combinatorics 





Imagine a grid like the one pictured above, with a mouse at the 
bottom-left corner of the grid. At the top-right corner is a piece of 
cheese. The mouse can travel only along the lines in the grid and 
would never move away from it. How many paths are there from the 
mouse to the cheese? 


A.5.3 Project that had the most impact 


What’s the project you worked on that had the biggest impact? 
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A.5.4 Data surprises 


Can you tell me about a time you found something in the data that 
surprised you? 





A.5.5 Previous job reflections 


What is the thing that you wanted to change most in your previous 
job that you couldn't? 





A.5.6 Senior person making a mistake based on data 


What would you do if you had calculations or results that conflicted 
with the previous results of a senior person in the company? Would 
you try to convince them that you were right, and if so, how? 


A.5.7 Disagreements with teammates 


Tell me about a time you disagreed with a teammate. What was it 
about, and what did you do? 





A.5.8. Difficult problems 


What do you do when you don’t know how to solve a data science— 
related problem? 





A.5.9 Working with Git 


Can you talk about a time when you used Git to collaborate on a 
project? 





A.6.0 Technology decisions 


| Given a totally blank slate, how do you pick your tech stack? 
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A.6.1. Frequently used packagellibrary 


What’s an R package or Python library you use frequently, and 
why? 


A.6.2. R Markdown or Jupyter Notebooks 


What is an R Markdown file or Jupyter Notebook? Why would you 
use an R Markdown file or Jupyter Notebook over an R or Python 
script? When is a script better? 


A.6.3 When should you write functions or packages/libraries? 


At what point should you make your code into a function? When 
should you turn it into a package or library? 


A.6.4 Types of joins 
| Explain the difference between a left join and an inner join. 
A.6.5 Loading data into SQL 


What are some different ways you can load data into a database in 
the first place, and what are the advantages and disadvantages of 
each? 


A.6.6 SQL query 


Here is TABLE_A from a school, containing grades from 0 to 100 
earned by students across multiple classes. How would you 
calculate the highest grade in each class? 


Grades Data 
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Aa Class = Student = Grade 
Math Nolis, Amber 100 
Math Berkowitz, Mike 90 
Literature Liston, Amanda 97 
Spanish Betancourt, Laura 93 


Literature Robinson, Abby 93 


A.6.7 Data types 


What disadvantages are there to storing a column of dates as 
strings in a database? In SQL, for example, what if we stored a 
column of dates as a VARCHAR(MAX) instead of DATE? 


A.6.8 Statistics terms 

| Explain the terms mean, median, and mode to an eight-year old. 
A.6.9 Explain p-value 

| Can you explain to me what a p-value is and how it’s used? 
A.7.0 Interpreting regression models 


How would you interpret these two regression model outputs, given 
the input data and model? This model is on a dataset of 150 
observations of 3 species of flowers: setosa, versicolor, and 
virginica. For each flower, the sepal length, sepal width, petal 
length, and petal width are recorded. The model is a linear 
regression predicting the sepal length from the other four variables. 


Input data to the model 
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Sepal.Length Sepal.Width Petal.Length Petal.Width Species 
<dbl> 


<dbl> 

1 Seil 

2 4.9 

3 Al. 7 

4 4.6 
5 5 
Model call 


8} 


w ww w 


model <- Ilm(Sepal.Length ~ 


Output 1 
term estim 
<chr> <d 
(Intercept) 2. 
Sepal.Width (0) 
Petal.Length (0) 
Petal.Width -0 
Speciesversicolor -0 
Speciesvirginica -1 
Output 2 
variable value 
<chr> <dbl> 
r.squared 0.867 
adj .r.squared 0.863 
sigma 0.307 
statistic 188 
p.value 2.67e-61 
df 6 
LogLik -32.6 
AIC 79.1 
BIC 100 
deviance 13.6 
df.residual 144 
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1.4 
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std.error statistic 
<dbl> 


aoe ec oc © 


<dbl> 
. 280 
.0861 
.0685 
.151 
.240 
.334 


A.7.1 Favorite algorithm 
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.76 


< 


NOU WEBERA BE 


dbl> <fct> 

0.2 setosa 
setosa 
setosa 
setosa 
setosa 


O O O O 
N NNN 


p.value 

<dbl> 
.43e-12 
.87e- 8 
.07e-23 
.-89e- 2 
.06e- 3 
.58e- 3 


What’s your favorite machine learning algorithm? All right, can you 
explain it to me? 


A.7.2. Model behavior 


Given a model you developed, how would you design a metric to 
evaluate it from the end user’s perspective? How would you decide 
what errors are acceptable? 


A.7.3. Experimental design 


You're developing an app and want to determine whether a newly 
designed layout would be better than your current one. How would 
you structure a test to pick the better app layout? 


A.7.4 Flaws in experimental design 


Assume that you’ve done an A/B test to select a better app layout; 
what is a case in which you might not want to implement the new 

layout despite seeing a statistically significant improvement in the 

metric you’re testing? 


A.7.5 Bias in sampled data 


What types of biases should you be aware of when using sample 
data? How can you tell whether a sample is biased? 


A.7.6 Deploying a new model 


You developed a new model that performs better than your old 
model currently in production. How do you determine whether you 
should switch the model in production? How do you go about it? 
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Answers and Notes 


A.5.1. Example answer 


| estimate the number of bottles by using the following formula: 


number of hotels in the US * average number of rooms per hotel * 1 shampoo 
bottle per occupied room per night * average room utilization * 365 days per 
year = number of shampoo bottles per year 


Then | estimate the numbers in the formula: 


e Number of hotels in the United States— If | assume that there is a hotel for every 
5,000 people in the country, and there are around 300 million people in the country, 
that’s 60,000 hotels. 


e Number of rooms per hotel— Fifty seems like a decent guess for the average 
number of rooms in a hotel from the hotels I’ve stayed in. 


e Average room utilization— Because hotels need to be profitable, l'Il guess that 
each night, a room has an 80 percent chance of being occupied. 


This makes the formula 60,000 * 50 * 1 * 0.8 * 365 = 876 million bottles. 


A.5.1. Notes 


The solution to this question is to come up with a formula for the number you're trying to 
estimate and take a guess at the numbers to put in the formula. There are many, many 
versions of this question, from “How many ping-pong balls can fit into a Boeing 747 
airplane?” to “How many pianos are there in France?” The interviewer is looking to see 
whether you can come up with a formula that makes some sense and that your logic for 
guessing each of the numbers in the formula makes sense. There’s almost no chance 
that you'll get the number close to right during the interview (we have no idea whether 
50 is a good guess for the average number of hotel rooms, for example), but that’s not 
important. 


There isn’t much you can do to prepare for these questions except practice the 


improvisational component of coming up with formulas and estimates on the fly. 


A.5.2. Example answer 
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To get to the cheese, the mouse has to move one space alone a horizontal line in the 
grid nine times and then move one space along a vertical line in the grid six times 
(because the grid is 9x6). Let’s call a horizontal move H and a vertical move V. 

Then any string with 9 Hs and Vs is a valid path from the start to the end. Going straight 
up and then to the right, for example, would be VVVVVVHHHHHHHHH. There are 15 
factorial (15!) ways to arrange 15 distinct characters, which are called permutations, but 
because 6 of them are the same letter (V) and 9 of them are the same letter (H), we 
have to remove all the duplicate arrangements. We can remove them by counting how 
many duplicates there are of each. The Vs are duplicated 6! times (the number of ways 
they can be arranged), and the Hs are duplicated 9! times. That means that the answer 
is 15!/(6!)/(9!), or 5,005 paths. 


A.5.2. Notes 


This question is a really hard one to answer. First, it’s hard to know the right answer. If 
you've somehow studied the field of combinatorics, you may know it; otherwise, it’s hard 
to suddenly realize that you can think of the problem as arranging paths. Even if you 
see that way of formulating the problem, you may not know how to count the number of 
solutions. 


Second, even if you know the answer, it’s hard to give it in a way that clearly explains 
the problem and solution without being verbose. You can’t assume that everyone knows 
terms such as permutation, yet if you were to explain it all, you’d spend too much time 
on it. 


Finally, there really is no way to study for this question. There are so many 
combinatorics questions that you can’t have answers for all of them prepared in 
advance. Your best bet for questions like these is to explain your thought process and 
how you might approach the problem. If the interviewer is putting a lot of weight on 
questions like this one, that’s a red flag. 


A.5.3 Example answer 


In my last job, | was brought in to build an online experimentation, or A/B testing, 
analytics system. The company was interested in starting to run experiments and had 
an engineer who could implement them, two growth marketers who could come up with 
the ideas and lay out the changes, and a manager, but they needed a way to 
understand what the results of the experiment were. 
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When | started, | analyzed each experiment individually in R. But | knew that this wasn’t 
the best system: it meant that the team needed me to run the scripts to see the results 
and that | was duplicating work across analyses. 


That’s what led me to build an internal dashboard to monitor experiments. This 
dashboard included not only the results of each experiment, such as the percentage of 
people who registered or subscribed in the control versus the treatment group, but also 
health checks to make sure that the experiment was running as expected and that the 
results could be trusted. With this dashboard, anyone at the company could see the 
most up-to-date results. 


By the time | left, this dashboard was being used for all the experiments being run 
across five teams. Thanks to the work | did with the rest of the experimentation team, 
almost every feature the company launches now is first tested as an experiment to 
measure whether it has any positive impact. 


A.5.3 Notes 


For this answer, if you have done any data science projects for a company, you want to 
use one of those instead of a non-data science project. On the other hand, if you’ve 
done data science projects only for personal use or class assignments, you can 
highlight another project. The biggest thing here is to focus on the impact on the 
business. Saying “I built a model with 90% accuracy!” is not what they’re looking for; 
they want to understand how someone used the model, tool, or analysis you built and 
why it mattered. 


A.5.4 Example answer 


My previous job was at a company that made money with subscriptions. | worked on 
experiments there, and when | started, | would calculate the subscription rate in an 
experiment as the percentage of people who entered who later subscribed. Although 
that sounds good, it turned out that people had subscriptions starting in the future! 


After talking with the data scientist who owned the subscription data, | found out that 
these subscriptions with future start dates were subscriptions that someone had 
paused. Take a user with a monthly subscription starting in September. She could 
choose instead of renewing or canceling her subscription in October to pause it, not 
paying and losing access for two months, but then having the subscription start again in 
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December. In this case, she’d have two rows in the subscription table: one for the 
September-to-October subscription and then starting in December. 


For my use case, | didn’t want to count those subscriptions that would start because 
they would become unpaused; | wanted only subscriptions that someone was actively 
choosing! 


| learned two lessons: that | should never make assumptions about the data and that | 
may need to customize a data source for my needs. I’d assumed that it would be 
impossible for subscriptions to start in the future, so | hadn't checked for that. When | 
realized this issue, | didn’t overwrite the original data, because other people still needed 
to know about subscriptions that had been set to start in the future. Instead, | made my 
own table that counted only new subscriptions. 


A.5.4 Notes 


In this answer, we used an example in which we were surprised by what is essentially a 
data quality issue for our use case. But you could talk instead about a time your intuition 
just didn’t match the results, for example, an exploratory data analysis you did of the 
Reddit subthread on data science, when you thought the word count of posts would 
correlate positively to the number of comments, but it turned out that there was a 
negative correlation. You also want to make sure that you explain why you held your 
initial assumption. 


This question is testing whether you think about your data before simply diving into it. 
It’s also testing that you don’t just try to confirm your initial hypothesis but let yourself be 
surprised by results and adapt to the new information. 


A.5.5 Example answer 


| found that at my last company, there were real struggles with communication. The 
leadership team was constantly asking people to be more open and express their 
concerns, but it wouldn’t happen. My theory is that it was due to the leaders themselves 
not being open; they would constantly tell us everything was going great when we knew 
that there were problems. 


The thing | wanted to change the most was to have the leadership open up to us. If they 
would express their own struggles and concerns more, it would have made it easier for 
the more-junior employees to open up and made for a better working environment. 
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A.5.5 Notes 


This question is tricky. You need to show that you understood your previous working 
environment well enough to have a proposed improvement for it, but you need to do it 
while making it sound as though you had a good relationship with your previous 
employer. 


You could list lots of different types of changes, such as technical changes, team 
dynamic changes, and product changes. The more meaningful a change you can list, 
the better (So not “I wish we had free soda”). It’s also great if you can reflect on why that 
change didn’t happen (“I wish we had been using a modern language like R or Python, 
but we were using SAS because of all the legacy products we maintained”). Explaining 
why the change hadn’t happened shows that you put thought into the limitations of the 
environment. 


Avoid insulting your previous employer (“Can you believe they were unintelligent 
enough to use FORTRAN!?”). You don’t want to give the impression that someday, 
you'll leave your next company and insult them too. Be respectful of the work your past 
employer has done, even if it’s ultimately lacking. 


A.5.6 Example answer 


First, I'd ask myself whether this result is important enough to bring up. If it was off by a 
small percentage, but we'd still make the same decision with the new results, or if the 
previous results were never used for anything, | might let it go. 


If not, I’d start by trying to understand the motivations and goals of the other person. 
Suppose that this person was the vice president of sales and they had done an analysis 
showing that each salesperson they hired brought in more than twice their salary in 
sales. Then they used this analysis to justify hiring five more people for the team. If | 
show that each salesperson actually brings in less than their salary, that could 
jeopardize the whole sales department. People would have a lot riding on the outcome 
so it would be important to be careful. 


I’d set up a meeting with them. By understanding the situation, | could make an 
educated guess about how they would react. If the results conflicted because they had 
an error in their analysis or if the result was fundamental to their business, I’d expect 
that they might be defensive and try to find flaws in my analysis, so I'd prepare 
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emotionally and triple-check my results. I'd try to find a solution that lets them save face, 
pivot their strategy, and put the business in the right direction. 


In the worst case—they didn't listen or offer any valid reasons why the new results were 
wrong, and | think the new results are vital to the business—l’d work with my manager 
to come up with a strategy to have the new results shared and acted on. Unfortunately, 
sometimes people won’t end up agreeing with your new analysis, and the focus needs 
to shift from convincing them to finding a way to accomplish your goals and limit the 
impact of a wrong analysis. 


A.5.6 Notes 


This question seeks to understand how'd you handle conflict with someone more senior. 
Although many answers exist, some like “I would email everyone in the company to 
publicly talk about how wrong they were” or “I’d always handle it this way no matter the 
situation because the only thing that matters is the data” would definitely be an issue. 
Academics especially can struggle with handling conflict in businesses; in academia, 
talks can be a contest of who in the audience can find the most flaws in the research 
and tear down the arguments. In industry, on the other hand, you need to be able to 
share your viewpoint and help the business make right decisions while balancing other 
factors and understanding the nuances of different situations. The interviewer is looking 
for signs that you’ve successfully resolved disagreements before, so exhibit that 
experience as much as you can. 


A.5.7 Example answer 


One time, | was working with a product manager on an experiment in which we'd set a 
run time of two weeks based on a power calculation. Four days in, they wanted to stop 
the experiment early and fully launch it because the p-value was .04 on the main 
success metric. But | knew this could be an artifact of peeking: by checking your results 
every day to see whether the p-value dips below .05 and stopping if it does, you greatly 
increase your false-positive rate. | also knew that the product manager was highly 
incentivized to have a successful experiment: one of the main metrics they were 
evaluated on was the incremental revenue gained through successful experiments. 


In this case, | focused on making sure that | knew where they were coming from and 
asking questions. | brought us back to our shared goals: making the company as 
successful as possible. | walked them through a simple example from the webcomic 
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xkcd to help them develop the intuition about why stopping early could be a problem: 
that if you check whether 20 different colors of jelly beans were associated with acne, 
even if none was, by chance a statistical test would likely “find” that one of them was. 
(Here is a link to the comic, for reference: https://xkcd.com/882/.) In the same way, we 
were chasing statistical ghosts and were liable to trick ourselves into thinking that we 
had a positive impact when we didn't. In the end, they agreed to keep the experiment 
running for the planned two weeks. 


This situation also led me to think more about how | could improve the experimentation 
tool to make it easier for people to do the right thing. One experimentation platform | 
know has a little circle that fills up more every day and turns in to a check mark at the 
end of seven days. This helped people run their experiments for at least a week, which 
is best practice. 


A.5.7 Notes 


This answer uses the STAR (situation, task, approach, result) approach. STAR is a 
classic framework for answering behavioral interview questions, as it provides a 
structure for the answer that is easy to follow. When thinking of a good example for this 
question, you want to find a situation that had a positive result, not “And then we never 
spoke again” or “I got him fired.” You also want the disagreement to be business- 
related, not “We disagreed about how to load the office dishwasher.” Interviewers are 
looking to see whether you can empathize with someone you disagree with and avoid 
bad-mouthing them or blaming them for your problem. 


A.5.8 Example answer 


For coding questions, Google is my friend. Often, an answer in the Stack Overflow 
question is the first result if | Google an error message or something like “How do | do 
Latent Dirichlet allocation in R?” If | know what the function or package | want to use is, 
but I’m not sure exactly how it works, I'll check for any documentation. 


But sometimes, | don’t know how to approach a problem. In those cases, | usually start 
breaking down the problem, sometimes by writing the different components on the 
whiteboard. This helps me get down to the core issues, which may be ones | know how 
to tackle, even if initially the whole problem seemed daunting. 


| also like the rule of spending 15 or 30 minutes on the problem myself (depending on 
whether | feel that I’m making any headway) and then asking another data scientist at 
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my company for help. Its my responsibility to try to figure it out on my own first, but it’s 
also on me not to be stuck on something for a whole day when a colleague could have 
helped me in a few minutes. When I reach out, l'Il share what I’ve tried along with a 
small, reproducible example to make it easier for the other person to see the issue 
(rather than send them hundreds of lines of code to parse). 


A.5.8 Notes 


Data science is a field in which you'll continually be learning and challenged by 
problems you’ve never seen before, so it’s important to develop a few strategies for 
getting unstuck. One thing this question is looking for is that you’ve developed 
strategies for outside the classroom setting, where you had an answer sheet, 
classmates, and a professor there to help you. You'll potentially need to tailor this 
answer to the company you’re talking to. If you say that your main strategy is asking 
your data scientist colleagues, and you’re interviewing to be the first data scientist, that 
answer will be a red flag. 


A.5.9 Example answer 


At my last job, | created an R package, funneljoin, with two co-workers during an 
afternoon hackathon, using Git from the beginning. We spent the first hour pair- 
programming on one computer and then made a list of tasks to split among ourselves. 
Each of us created a different branch to work on our tasks, which enabled us to easily 
merge them back together in the end. Using Git ensured that we never accidentally 
overwrote someone else’s work. By committing early and often as we progressed, we 
knew that we could always go back if we decided that a previous way of implementing a 
feature was better. Finally, using GitHub meant that afterward, anyone in the company 
could download the package and start using it right away. 


Since that initial afternoon, I’ve remained the maintainer of the package. | continue to 
use Git features like branches so that | can prototype features without merging them in 
until they’ve been thoroughly tested. 


A.5.9 Notes 


If you’re asked this question and haven't collaborated using Git on a project before, you 
can talk instead about how you've used Git for a personal project. One thing this 
question is testing is whether you’ve used Git, and you can explain that you’ve used 
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different features (Such as branches or forking), even if just by yourself. You might add 
how you would adapt your practices if you were collaborating with other people—by 
using branches more, for example, or by sticking to a consistent commit message 
structure. 


If you haven't used Git before, be honest about that in the interview. But we do 
recommend trying to learn it before having too many more interviews. 


A.6.0 Example answer 


This question is interesting to answer because it really depends on the project at hand. 
My decision for a tech stack relies primarily on balancing what would be the most 
straightforward for me to implement with what would be easiest for everyone else to 
work with. Let me give two examples of how | picked technical stacks and what | 
learned from them. 


On one project earlier in my career, | had to develop a new product entirely from 
scratch, and | was the only data scientist on my team. | chose to use the .NET stack 
and F# because | was very familiar with them, and because | was so familiar, we were 
able to get a working product out the door quickly. The downside was that because the 
F# language is so uncommon, when it came time to hire a data scientist to take it over, 
we couldn't find anyone who already had the required knowledge. In retrospect, using 
.NET and F# was not the right decision. 


On a more recent project, | was tasked with creating a machine learning API. | was in 
the middle of an engineering team that worked with microservices, so | decided to 
create a REST API in R as a Docker container. Although | hadn’t used Docker 
containers before, | chose it because | knew that it would be easiest for the team to 
maintain. From that project, | learned a lot about Docker and containers, and the work | 
created was able to integrate well. 


A.6.0 Notes 


When interviewing candidates at T-Mobile, Heather Nolis usually has a person pick a 
project and describe the decisions they made; the decisions other people made; and 
how they would do things differently, knowing what they know now. Your interviewer 
may not ask you those things directly, but they’re all worth including anyway. 
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Whatever answer you give, you'll want to include plenty of references to decisions 
you've had to make in work you’ve done before (which can include side projects or 
coursework). The point of this question is to see how much thought you have put into 
choosing the right technology for the right project. Having chosen technological stacks 
that ended up being problematic is fine as long as you learned from the problem. In fact, 
having learned from things is even better than getting everything right the first time, 
because it shows that you can change. 


A.6.1 Example answer 


It’s not one package, but | really love the suite of packages that make up the tidyverse 
in R. The packages get you all the way from reading in the data to cleaning it to 
transforming and visualizing and modeling it. 


| especially enjoy working in dplyr because, thanks to the connected package dbplyr, | 
can write the same code whether I’m working with a local or remote table as dbplyr 
translates dplyr code to SQL. Using dbplyr at my last job meant | could stay in RStudio 
for my entire workflow, even though all our data was stored in Amazon Redshift and 
required SQL queries to access. | would use dbplyr to do summaries and filtering and 
then pull the data down locally if | needed to do more complicated operations or 
visualizations. 


Overall, | really like the philosophy of Hadley Wickham, a core tidyverse developer: that 
the bottleneck when coding is often thinking time, not computational time, and that you 
should build tools that work seamlessly together and let you translate your thoughts into 
code quickly. 


A.6.1 Notes 


The interviewer shouldn't be looking for a specific answer here. Rather, they’re looking 
to see whether you (a) program enough in either language to have a frequently used 
package and (b) can explain how and why you use that package. This answer also 
gives the interviewer a sense of what kind of work you do day to day. Don’t forget to 
explain what the package does, especially if it’s a niche package. If another package is 
more widely used for the task, you may want to explain your reasoning for why you 
choose this package, as that shows your broader awareness of what alternatives are 
out there. Finally, don’t worry about picking the most “advanced” library, such as a deep 
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learning library, to impress the interviewer. Ideally, this question is one of the easiest 
and (potentially) most fun to answer, so don’t overthink it. 


A.6.2 Example answer 


I’m going to answer this in the case of R and R Markdown, but the basic idea is the 
same for Python and Jupyter Notebooks. R Markdown files are ways of writing R code 
that allow you to put text and formatting around the code. In a sense, they merge the 
code and results of an analysis with the narration and ideas of the analysis. By using an 
R Markdown file, you can have an analysis that is easier to reproduce than raw R code, 
with separate documentation on what the analysis was. Ideally, your R Markdown would 
be formatted so clearly that when you render the output file, you could hand the 
resulting HTML output, Word document, or PDF to a stakeholder. 


R Markdown files are great for reproducible analysis, but they're less useful when you’re 
writing code that you’re going to deploy or use in other places. Say you have a list of 
functions that you want to use in multiple other places (Such as one to load the data 
from a file). It might make sense to write an R script that creates all the functions and 
keep the script separate from an individual analysis. Or, if you wanted to use R with the 
plumber package to create a web API, you wouldn't want an R Markdown file for that. 


A.6.2 Notes 


This question is a check by the interviewer to see whether you have experience in doing 
reproducible analysis. Many people who use R or Python write scripts in ad hoc ways 
and don’t think about how they will share the results with others. By showing that you 
understand the point of R Markdown and Jupyter Notebooks, you show that you’re 
thinking about how to make your code more usable. If you haven’t used either R 
Markdown files or Jupyter Notebooks, definitely go try one of them out. 


Don’t worry about understanding both R and Python versions; either should be fine. 


A.6.3 Example answer 


In general, if | notice that I’m ever copying and pasting code, it’s probably a sign that | 
should make a function out of it. If | need to run code on three different datasets, for 
example, | should make a function and apply it to each one rather than copy the code 
three times. The library purrr in R or list comprehensions in Python make it easy to 
apply a function many times. 
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| have found that packages and libraries are best when you have code that spans 
multiple distinct projects on the team. At my current job, we have a lot of data that we 
store in S3 but want to analyze locally. Rather than copying and pasting functions to 
access the code in each project, | created a library that could be called from all of them. 
The downside of libraries is that if you change them, you have to change all the projects 
that use the library, but for core functions, this approach is often worthwhile. 


A.6.3 Notes 


This question is somewhat of a softball question because it has a right answer: “as 
much as possible.” Generally, copying and pasting code over and over is a bad practice; 
a data scientist should make functions in such a way that the code is easier to read and 
understand. For that reason, as much as you can, add examples that show you 
understand the value of reusing code as functions or packages. Was there a time you 
made a function and reused it a lot? What about a library? Talk about those situations 
as much as possible. 


A.6.4 Example answer 


Joins are ways of combining data from two different tables—a left table and a right table 
—into a new one. Joins work by connecting rows between the two tables; a set of key 
columns is used to find data in the two tables that are the same and should be 
connected. In the case of a left join, every row from the left table appears in the 
resulting table, but rows from the right table appear only if the values in their key 
columns show up in the left table. In an inner join, however, both rows from the left table 
and the right table appear only if there is a matching row in the other table. 


In practice, you can think of a left join as attaching data from the right table to the left, if 
it exists (Such as using the right table as a lookup). An inner join is more like finding all 
the shared data and making a new table from only the pairs. 


A.6.4 Notes 


Janda likes this question as an early screener for more-junior roles because it isn’t a 
trick question and is important knowledge for a candidate to have. She finds that you 
can learn a lot from how the candidate chooses to answer. There are plenty of valid 
answers, from ones that are textbook-correct but not easy to understand to very simple- 
to-understand ones that miss edge cases. 
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Notice that in our answer, we didn’t talk about any complexities from duplicate rows 

appearing in the data. It might be worthwhile to mention these complexities because 
they can affect the results, but more likely, they’re a distraction from the point you're 
trying to get across. 


A.6.5 Example answer 


There are many ways to load data into a database, primarily depending on where the 
data exists in the first place. If the data is in a flat file such as a CSV file, many SQL 
versions have programs to import the data. SQL Server 2017, for example, has an 
Import and Export wizard. These tools are easy to use but do not allow for much 
customization and are not easily reproducible. If the data is coming from a different 
environment, such as R or Python, there are drivers that allow for passing the data into 
SQL. An ODBC driver, for example, can be used along with the DBI package in R to 
move data from R into SQL. These methods are more reproducible and programmatic 
to implement, but they require you to get the data into R or Python. 


A.6.5 Notes 


This question is really a test of whether you’ve had to load data into a database before. 
If you’ve done it before, it shouldn’t be too difficult to describe how you did it. If you 
haven't loaded data into a database before, that may signal to the interviewer that you 
don’t have enough experience. 


The part of the question about advantages and disadvantages checks whether you 
understand that different tools are better in different situations. Sometimes, using a GUI 
to upload data is a nice, easy solution when you have a single file. At other times, you'll 
want to set up a whole automated script to load the data continuously. The more you 
can show that you understand the nuances of what to use when, the better. 


A.6.6 Example answer 


Here is a query to find the highest grade in each class: 


SELECT CLASS, MAX(GRADE) 
INTO TABLE_B 

FROM TABLE_A 

GROUP BY CLASS 
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This query groups the data into each class and then finds the max from it. It additionally 
saves the result into a new table (TABLE_B) so that the results can be queried later. 


A.6.6 Notes 


This question is almost as simple as you can get for a SQL question; it’s testing whether 
you have basic understanding of grouping in SQL. The reasons why people typically 
mess this question up include not seeing what to group (in this case, the class variable), 
or they find the question to be so easy that they overcomplicate it and miss the simple 
solution. If you are in an interview and a question seems to be too easy, it very well may 
be as easy as it seems. 


If that solution doesn’t seem to be obvious to you, now would be a good time to review 
how grouping variables work in SQL. 


Finally, the INTO TABLE_B line was totally optional, but it sets you up well for the next 
question. 


A.6.7 Example answer 


Having dates stored as strings instead of dates (Such as storing March 20, 2019 as the 
string "03/20/2019") is a common situation in databases. Although you may not lose any 
information, depending on how you do it, you may experience performance hits. First, if 
the data isn’t stored as a DATE type, we couldn’t use the MONTH() function on it. We 
also couldn't do things like find the differences between two dates or find the minimum 
date in the column. 


This problem tends to happen a lot when you're loading data into a database or 
cleaning it. The earlier you can correctly format the data, the easier the analysis will be. 
You can fix these sorts of situations by using functions like CAST. That being said, 

if you're loading in data with hundreds of columns and there are plenty you'll never use, 
it may not be worth the time to fix all these issues. 


A.6.7 Notes 


The problem of having data stored in an incorrect type is an extremely frequent 
occurrence. It doesn’t happen just in databases; it can also happen in flat files or in 
tables within environments such as R and Python. This question is checking whether 
you understand that when this happens, it’s generally bad, and when you see these 
situations, you generally should fix them. Being able to answer questions like this one 
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should come naturally from doing data cleaning as part of data science projects that 
start with messy data. 


A.6.8 Example answer 


Mean, median, and mode are three different types of averages. Averages let us 
understand something about a whole set of numbers with just one number that 
summarizes something about the whole set. 


Suppose that we did a poll of your class to see how many siblings each person has. You 
have five people in your class, and let’s say you find that one person has no siblings, 
one has one, one has two, and two have five. 


The mode is the most common number of siblings. In this case, that’s 5, as two people 
have five siblings compared with only one person who has every other number. 


To get the mean, you get the total number of siblings and divide that by the number of 
people. In this case, we add 0 + 1*1 + 1*2 + 5*2 = 13. You have five people in the class, 
so the mean is 13/5 = 2.6. 


The median is the number in the middle if you line them up from smallest to largest. 
We'd make the line 0, 1, 2, 5, 5. The third number is in the middle, and in our case, that 
means the median is two. 


We see that the three types of averages come up with different numbers. When do you 
want to use one instead of the other? The mean is the most common, but the median is 
helpful if you have outliers. Suppose that one person had 1,000 siblings! Suddenly, your 
mean gets much bigger, but it doesn’t really represent the number of siblings most 
people have. On the other hand, the median stays the same. 


A.6.8 Notes 


It’s unlikely that someone interviewing for a data science position won't know about the 
different types of averages, so this question is really testing your communication skills 
rather than whether you get the definitions right (although if you get them wrong, that’s a 
red flag). In our example, we used a simple example that an eight-year old might 
encounter in real life. We recommend keeping the number of subjects simple; you don’t 
want to get tripped up doing the math for the mean or median because you're trying to 
calculate them for 50 data points. If there’s a whiteboard in the room, it might be helpful 
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to write out the numbers to keep track of them. As a bonus, you can add as we did 
when you might want to use one type of average instead of another. 


A.6.9 Example answer 


Imagine that you were flipping a coin and got 26 heads out of 50. Would you conclude 
that the coin wasn’t fair because you didn’t get exactly 25 heads? No! You understand 
that randomness is at play. But what if the coin came up heads 33 times? How do we 

decide what the threshold is for concluding that it’s not a fair coin? 


This is where the p-value comes in. A p-value is the probability that, if the null 
hypothesis is true, we’d see a result as or more extreme than the one we got. A null 
hypothesis is our default assumption coming in, such as no differences between two 
groups, that we're trying to disprove. In our case, the null hypothesis is that the coin is 
fair. 


Because a p-value is a probability, it’s always between 0 and 1. The p-value is 
essentially a representation of how shocked we would be by a result if our null 
hypothesis is true. We can use a Statistical test to calculate the probability that, if we 
were flipping a fair coin, we would get 33 or more heads or tails (both being results that 
are as extreme as the one we got). It turns out that probability, the p-value, is .034. By 
convention, people use .05 as the threshold for rejecting the null hypothesis. In this 
case, we would reject the hypothesis that the coin is fair. 


With a p-value threshold of .05, we’re accepting that 5% of the time, when the null 
hypothesis is true, we’re still going to reject it. This is our false-positive rate: the rate of 
rejecting the null hypothesis when it’s actually true. 


A.6.9 Notes 


This question is testing whether you both understand what a p-value is and can 
communicate the definition effectively. There are common misconceptions about the p- 
value, such as that it’s the probability that a result is a false positive. Unlike the 
averages question in the preceding section, it’s possible for someone to get this wrong. 
On the communication side, we recommend using an example to guide the explanation. 
Data scientists need to be able to communicate with a wide variety of stakeholders, 
some of whom have never heard of p-values and some who think that they understand 
what they are but don’t. You want to show that you both understand p-values and can 
share that understanding with others. 
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A.7.0 Example answer 


Looking at the model summary results, it looks like a very good model; the R-squared is 
0.867, meaning that the predictors explain 86.7 percent of the variance in sepal length. 
The predictors are all significant at the p less than .05 level. | see that the wider the 
sepal and the longer the petal, the longer the sepal, whereas wider petals actually are 
associated with shorter sepals. Both the versicolor and virginica species have negative 
coefficients, which means that we’d predict those species to have a smaller sepal length 
than the setosa species. 


Suppose that we found a new flower, with a sepal width of 1, petal length of 2, petal 
width of 1, and that it was the virginica species. Our model would predict the 

sepal length to be the following: 2.17 + .496 * 1 + .829 * 2 — .315 * 1 — 1.02, which is 
about 3. Before using this model, though, I’d want to look at a few more diagnostics, 
such as whether the residuals are normally distributed, and I’d want to find a test set to 
see how it performs out of sample to make sure it’s not overfitting. 


A.7.0 Notes 


The interviewer is looking for multiple things, and you can get points depending on how 
many you get right. In this case, the interviewer is checking whether you understand the 
model statistics (Such as R-squared), as well as the estimates and their associated p- 
values. Although this information wasn’t explicitly asked for, in our answer, we added 
how we would use this model to predict the sepal length of a new flower. Finally, we 
added some information about the model we’d want to know before we started using it. 
This type of open-ended question is a good opportunity to hit what the interviewer is 
probably looking for and to add bonus information. Avoid trying too hard and ending up 
spending 20 minutes on a single question; show that you understand as many concepts 
as you can and then move on. 


A.7.1 Example answer 


My favorite machine learning algorithm is a recurrent neural network. | have been doing 
a lot of work with natural language processing lately, and recurrent neural networks are 
great models for classifying text quickly. 


Do you know a linear regression? A neural network is like a linear regression except 
that you have groups of linear regressions and the output of one group of linear 
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regressions is the input to the next group. By tying all these linear regressions together 
into layers of models, you can make predictions much more accurately. 


A recurrent neural network is a special case of a neural network that’s tuned for data 
that falls in sequences. In the case of natural language processing a block of text, the 
output part of the way through a sequence of words is the input for the model of the next 
words. 


A.7.1 Notes 


This question is one of the many you might get during an interview that are designed to 
see whether you can explain a complex idea in a simple way. What algorithm you 
choose for your answer isn’t nearly as important as being able to express how it works 
clearly. That said, this question is a great opportunity to highlight interesting past work 
you've done by expressing an algorithm that relates to the work and talking about it. 


A.7.2 Example answer 


Standard model metrics like R-squared or accuracy can miss the end-user or business 
perspective. A classification model could be right 99% of the time, but the 1% of the 
time it’s wrong, it’s such a problem for the business that the model would never be used. 


| find that the best way to evaluate a model is to try running an experiment with it. If I’m 
creating a model to cluster customers into segments, for example, | would present the 
clusters to marketing and have them try to do a test run of custom marketing to a 
sample set of customers from the different segments. | would compare how well the 
marketing performs with and without the customers segmented, and if there is a 
meaningful improvement, the model is a success. That’s totally different from using 
metrics about the model itself, such as how effectively it performs the segmentation, 
because those sorts of measures analyze only the model. Here, I’m actually analyzing 
how it performs compared with no model at all. 


The downside of running an experiment with the model is that it’s often difficult to set up 
the experiment. Sometimes, you can’t split your customers into ones who get the model 
and ones who don't. At other times, the effect of the model is so small that it wouldn’t 
show up in any KPIs that are easy to measure. But despite these difficulties, if it’s 
possible to run an experiment, that’s almost always the best approach. 


A.7.2 Notes 
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This question is tricky because it’s very general, but to answer it, you need to talk about 
specifics. Your answer could vary dramatically for a predictive model versus an 
unsupervised model or, if you’re working with marketing, versus the operations 
department. You'll want to talk a lot about the idea that the statistical measures aren’t 
the same as the measures that the business cares about; junior data scientists can get 
overly focused on maximizing the statistical measures and ignoring the business ones. 
But how you end up talking about these ideas is very open. As with many answers, if 
you can bring examples from your experiences, you can add a lot of depth. 


A.7.3 Example Answer 


There are lots of different ways to answer the specifics of this question, but A/B tests 
generally follow this type of flow: 


1. Define what better means by picking the metric(S) you care about improving: active 
users, button clicks, impressions, and so on. 


2. Choose a null hypothesis based around your success metric, such as “Button clicks 
will be the same for all groups.” Use that hypothesis to run a power calculation, 
which will tell you how long you need to run the test for to detect a change of a 
certain size. 


3. Randomly split your population of app users into groups, and provide each group a 
different version of the app. 


4. After you’ve run the test for the length of time you decided on in step 2, evaluate 
whether you see a Statistically significant difference between the two groups by 
using an appropriate statistical test (like a t-test). 


A.7.3 Notes 


Questions like this one are common for data science roles on teams that are heavily 
involved in media measurement, app/web development, and so on. The interviewer 
usually just wants to know that you understand the purpose and general principles of 
A/B testing, especially for more-junior roles. Rather than getting bogged down in the 
specifics of stat testing (such as when to use a chi-square test instead of a t-test), we 
recommend sticking to a clear high-level approach when answering to demonstrate that 
you know how to design an experiment and determine causality. 
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A.7.4 Example answer 


You wouldn't want to implement the layout if you see it negatively affecting other 
important metrics (guardrail, or do-no-harm, metrics). An example might be a situation 
in which the metric you're testing for is user click-throughs, and although you do see a 
significant improvement in click-throughs for users exposed to the new layout, you also 
see pages in the app taking longer to load in that layout. In this case, the degradation in 
app performance may not be worth the increase in click-throughs, because over time, 
the worse in-app experience may drive users away. 


A.7.4 Notes 


This question is very open-ended. What the interviewer wants to See is your recognition 
that just finding a low p-value isn’t always a good-enough reason to consider an 
experiment successful. It’s risky for a company to make changes to a live product like 
an app or website, and a single statistical test usually doesn’t encapsulate all the 
information needed to make the right decision. Some other reasonable answers to this 
type of question include seeing too small an improvement relative to the cost and risk of 
changing the app or bias in the sampling/splitting methodology. 


A.7.5 Example Answer 


Many types of bias can affect sampled data. One of the most common biases in 
practical data science applications is selection bias (Selecting your sample incorrectly). 
Selection bias can happen in scenarios such as selecting a random group of customers 
from of a transaction-level table, which overrepresents customers with multiple 
transactions. Other types of common bias include survivorship bias (the sample 
overrepresents a group that made it past some preselection process) and voluntary 
response bias (the sample overrepresents a group that was more likely to volunteer 
information about themselves). 


There are statistical methods that you can use to identify bias in a sample, like 
comparing the mean value from your sample with a known or expected mean of the 
population. You should also think rationally about the sampling process to identify 
biases, trying to answer this question: is there something about the way we’ve sampled 
this group that might make it different from the population we care about? 


A.7.5 Notes 
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This question is meant to test your understanding of limitations in working with data and 
drawing conclusions. It’s less important to understand specific terms, like selection 

bias and survivorship bias, than to understand the ways in which data can be limited or 
misinforming. The interviewer wants to see that you understand the nuances of working 
with real-world data—all of which is biased in one way or another—and all the 
messiness that this data entails. Using data from an optional survey, for example, has a 
clear voluntary response bias. This doesn’t mean that the data is unusable, but it does 
mean that you should be aware of the bias, think about the consequences it has on your 
analysis, and take it into account in any conclusions you make. 


A.7.6 Example answer 


For me, the answer depends on a couple of factors in the environment. First, by what 
metric does the new model do better? Assuming that it’s overall accuracy, I’d check 
whether the model is sufficiently better that it’s worth swapping out the old model. If it’s 
only a percentage point better in accuracy, it may not be worth the effort of changing, 
because the effect might be negligible. Next, is there a risk to disrupting the current 
model? If the model was deployed using a well-maintained pipeline with clear logging 
and testing, I’d probably make the swap, but if the model was deployed by manually 
moving a model into a production system by a person who is no longer at the company, 
I’d probably hold off. 


Finally, is there a way to A/B test the model first? Ideally, I’d like to have the old and new 
models run in parallel so that | could test for any problems with the new model or edge 
cases missed by it. No test system can cover everything from production, so being able 
to have it running for a select set of customers or inputs first would be ideal. 


A.7.6 Notes 


Deploying a model is often a labor-intensive and risky proposition for a company. This 
question determines whether you understand what that’s like and how you would 
approach the situation. A more-junior data scientist or machine learning engineer may 
feel that the right choice is to deploy the most accurate model as quickly as possible, 
but there are risks that need to be managed. If you have any experiences you can draw 
on (such as model deployments failing), this question is a great place to mention them. 
If you haven't, that’s totally fine; just try to describe what you think might go wrong. 
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